AITopics | answer type

Collaborating Authors

answer type

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

222d2eaf24cf8259a35d6c7130d31425-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-19-2026, 04:32:20 GMT

arxiv preprint arxiv, benchmark, reasoning ability, (13 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(3 more...)

Genre: Research Report (0.92)

Industry:

Health & Medicine (0.68)
Education > Educational Setting (0.46)
Energy > Renewable (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Software > Programming Languages (0.92)
(2 more...)

Add feedback

Modulating early visual processing by language

Harm de Vries, Florian Strub, Jeremie Mary, Hugo Larochelle, Olivier Pietquin, Aaron C. Courville

Neural Information Processing SystemsNov-21-2025, 10:51:51 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > France > Hauts-de-France > Pas-de-Calais (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.96)

Add feedback

QuAnTS: Question Answering on Time Series

Divo, Felix, Kraus, Maurice, Nguyen, Anh Q., Xue, Hao, Razzak, Imran, Salim, Flora D., Kersting, Kristian, Dhami, Devendra Singh

arXiv.org Artificial IntelligenceNov-10-2025

Text offers intuitive access to information. This can, in particular, complement the density of numerical time series, thereby allowing improved interactions with time series models to enhance accessibility and decision-making. While the creation of question-answering datasets and models has recently seen remarkable growth, most research focuses on question answering (QA) on vision and text, with time series receiving minute attention. To bridge this gap, we propose a challenging novel time series QA (TSQA) dataset, QuAnTS, for Question Answering on Time Series data. Specifically, we pose a wide variety of questions and answers about human motion in the form of tracked skeleton trajectories. We verify that the large-scale QuAnTS dataset is well-formed and comprehensive through extensive experiments. Thoroughly evaluating existing and newly proposed baselines then lays the groundwork for a deeper exploration of TSQA using QuAnTS. Additionally, we provide human performances as a key reference for gauging the practical usability of such models. We hope to encourage future research on interacting with time series models through text, enabling better decision-making and more transparent systems.

large language model, machine learning, question answering, (22 more...)

arXiv.org Artificial Intelligence

2511.05124

Country:

North America > United States (1.00)
Europe (1.00)

Genre:

Research Report (1.00)
Workflow (0.68)
Overview (0.67)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)

Add feedback

222d2eaf24cf8259a35d6c7130d31425-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-9-2025, 20:54:00 GMT

arxiv preprint arxiv, benchmark, reasoning ability, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Oceania > New Zealand (0.04)
(3 more...)

Genre: Research Report (0.92)

Industry:

Health & Medicine (0.68)
Education > Curriculum > Subject-Specific Education (0.67)
Education > Educational Setting (0.46)
Energy > Renewable (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Software > Programming Languages (0.92)
(2 more...)

Add feedback

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Aggarwal, Pranjal, Kim, Seungone, Lanchantin, Jack, Welleck, Sean, Weston, Jason, Kulikov, Ilia, Saha, Swarnadeep

arXiv.org Artificial IntelligenceOct-7-2025

Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple math and general queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks along with harder math problems. Using novel thinking-adjusted accuracy metrics, we extensively evaluate 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.

accuracy, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2508.13141

Country:

Europe (0.68)
Asia > Russia > Siberian Federal District (0.28)
Asia > Russia > Far Eastern Federal District (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine (1.00)
Media > Music (0.94)
Education (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yan, Yuchen, Jiang, Jin, Ren, Zhenbang, Li, Yijun, Cai, Xudong, Liu, Yang, Xu, Xin, Zhang, Mengdi, Shao, Jian, Shen, Yongliang, Xiao, Jun, Zhuang, Yueting

arXiv.org Artificial IntelligenceSep-26-2025

Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.

benchmark, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2505.15801

Country:

Asia (0.68)
North America > United States (0.46)
Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Wang, Weida, Huang, Dongchen, Li, Jiatong, Yang, Tengchao, Zheng, Ziyang, Zhang, Di, Han, Dong, Chen, Benteng, Luo, Binzhao, Liu, Zhiyu, Liu, Kunling, Gao, Zhiyuan, Geng, Shiqi, Ma, Wei, Su, Jiaming, Li, Xin, Pu, Shuchen, Shui, Yuhan, Cheng, Qianjia, Dou, Zhihao, Cui, Dongfei, He, Changyong, Zeng, Jin, Xie, Zeke, Su, Mao, Zhou, Dongzhan, Li, Yuqiang, Ouyang, Wanli, Cai, Yunqi, Dai, Xi, Zhang, Shufei, Bai, Lei, Cheng, Jinguang, Fang, Zhong, Weng, Hongming

arXiv.org Artificial IntelligenceSep-1-2025

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.18124

Country: Asia > China (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Setting > K-12 Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement

Hasan, Mohammed Rakibul, Majid, Rafi, Tahmid, Ahanaf

arXiv.org Artificial IntelligenceAug-28-2025

In this paper, we introduce Bangla-Bayanno, an open-ended Visual Question Answering (VQA) Dataset in Bangla, a widely used, low-resource language in multimodal AI research. The majority of existing datasets are either manually annotated with an emphasis on a specific domain, query type, or answer type or are constrained by niche answer formats. In order to mitigate human-induced errors and guarantee lucidity, we implemented a multilingual LLM-assisted translation refinement pipeline. This dataset overcomes the issues of low-quality translations from multilingual sources. The dataset comprises 52,650 question-answer pairs across 4750+ images. Questions are classified into three distinct answer types: nominal (short descriptive), quantitative (numeric), and polar (yes/no). Bangla-Bayanno provides the most comprehensive open-source, high-quality VQA benchmark in Bangla, aiming to advance research in low-resource multimodal learning and facilitate the development of more inclusive AI systems.

large language model, machine learning, translation, (18 more...)

arXiv.org Artificial Intelligence

2508.19887

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

Ho, Xanh, Huang, Jiahao, Boudin, Florian, Aizawa, Akiko

arXiv.org Artificial IntelligenceApr-23-2025

Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges (LLM-as-a-judge). In this paper, we reassess the performance of QA models using LLM-as-a-judge across four reading comprehension QA datasets. We examine different families of LLMs and various answer types to evaluate the effectiveness of LLM-as-a-judge in these tasks. Our results show that LLM-as-a-judge is highly correlated with human judgments and can replace traditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with human judgments improves significantly, from 0.22 (EM) and 0.40 (F1-score) to 0.85. These findings confirm that EM and F1 metrics underestimate the true performance of the QA models. While LLM-as-a-judge is not perfect for more difficult answer types (e.g., job), it still outperforms EM/F1, and we observe no bias issues, such as self-preference, when the same model is used for both the QA and judgment tasks.

computational linguistic, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2504.11972

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (0.93)
Education > Assessment & Standards > Student Performance (0.75)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Where is this coming from? Making groundedness count in the evaluation of Document VQA models

Nourbakhsh, Armineh, Parekh, Siddharth, Shetty, Pranav, Jin, Zhao, Shah, Sameena, Rose, Carolyn

arXiv.org Artificial IntelligenceMar-24-2025

Document Visual Question Answering (VQA) models have evolved at an impressive rate over the past few years, coming close to or matching human performance on some benchmarks. We argue that common evaluation metrics used by popular benchmarks do not account for the semantic and multimodal groundedness of a model's outputs. As a result, hallucinations and major semantic errors are treated the same way as well-grounded outputs, and the evaluation scores do not reflect the reasoning capabilities of the model. In response, we propose a new evaluation methodology that accounts for the groundedness of predictions with regard to the semantic characteristics of the output as well as the multimodal placement of the output within the input document. Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences. We validate our scoring methodology using human judgment and show its potential impact on existing popular leaderboards. Through extensive analyses, we demonstrate that our proposed method produces scores that are a better indicator of a model's robustness and tends to give higher rewards to better-calibrated answers.

benchmark, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.1912

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback